Paraphrasing 4 Microblog Normalization

نویسندگان

  • Wang Ling
  • Chris Dyer
  • Alan W. Black
  • Isabel Trancoso
چکیده

Compared to the edited genres that have played a central role in NLP research, microblog texts use a more informal register with nonstandard lexical items, abbreviations, and free orthographic variation. When confronted with such input, conventional text analysis tools often perform poorly. Normalization — replacing orthographically or lexically idiosyncratic forms with more standard variants — can improve performance. We propose a method for learning normalization rules from machine translations of a parallel corpus of microblog messages. To validate the utility of our approach, we evaluate extrinsically, showing that normalizing English tweets and then translating improves translation quality (compared to translating unnormalized text) using three standard web translation services as well as a phrase-based translation system trained on parallel microblog data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LiveTweet: Microblog Retrieval Based on Interestingness and an Adaptation of the Vector Space Model

This paper presents the Institute for Web Science and Technology’s contribution to the TREC2011 Microblog Track. The goal of the Microblog Track is to address the user’s information need in which a user wishes to see not only the most recent but also the most interesting and relevant information to a query in Twitter. In this paper we present the LiveTweet system, submitted by the Institute for...

متن کامل

Accurate Word Segmentation and POS Tagging for Japanese Microblogs: Corpus Annotation and Joint Modeling with Lexical Normalization

Microblogs have recently received widespread interest from NLP researchers. However, current tools for Japanese word segmentation and POS tagging still perform poorly on microblog texts. We developed an annotated corpus and proposed a joint model for overcoming this situation. Our annotated corpus of microblog texts enables not only training of accurate statistical models but also quantitative ...

متن کامل

TREC Microblog 2012 Track: Real-Time Ranking Algorithm for Microblog Ranking Systems

As a matter of fact Twitter is becoming the new big data container, due to the deep increase of amount of users and its growing popularity. Moreover the huge amount of user profiles and rough text data, are providing continuosly new research challenges. This paper reports our contribution and results to the Trec 2012 Microblog Track. In this particular, challenge each participant is required to...

متن کامل

TREC Microblog 2012 Track: Real-Time Algorithm for Microblog Ranking Systems

As a matter of fact Twitter is becoming the new big data container, due to the deep increase of amount of users and its growing popularity. Moreover the huge amount of user profiles and rough text data, are providing continuosly new research challenges. This paper reports our contribution and results to the Trec 2012 Microblog Track. In this particular, challenge each participant is required to...

متن کامل

Normalization and Paraphrasing Using Symbolic Methods

We describe an ongoing work in information extraction which is seen as a text normalization task. The normalized representation can be used to detect paraphrases in texts. Normalization and paraphrase detection tasks are built on top of a robust analyzer for English and are exclusively achieved using symbolic methods. Both grammar development rules and information extraction rules are expressed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013